fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) by Chibionos · Pull Request #1692 · UiPath/uipath-python

Chibionos · 2026-05-29T07:20:57Z

Summary

Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini), but worked with OpenAI/GPT.

Fixes AE-1646 (customer: Sarasota Memorial Health Care System).

Root cause (live-verified on alpha)

Both eval mockers requested structured output via OpenAI-only response_format json_schema and parsed response.choices[0].message.content. The normalized LLM Gateway only honors response_format for OpenAI models — and the failure shape differs per provider (verified by live calls, not assumed):

Provider	`response_format` behavior on the gateway
OpenAI (gpt-4.1-mini)	honored — valid JSON content, native `$defs` support
Claude (sonnet-4-5)	ignored — returns plain prose content (e.g. `'Tokyo'`)
Gemini (2.5-pro)	returns empty content

Both non-OpenAI shapes broke json.loads(...) → wrapped as UiPathMockResponseGenerationError → AGENT_RUNTIME.UNEXPECTED_ERROR. Note the prose case matters: an earlier revision of this PR fell back only on empty content, which fixed Gemini but left Claude — the customer's model — still failing. The live e2e below caught that.

Regression from #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model.

Fix

Provider-aware strategy classes in eval/mocks/_structured_output.py (mirroring llm_as_judge_evaluator, whose docstring already states function calling is the cross-provider way to get structured output):

OpenAIStructuredOutput — prefers response_format (reliable for OpenAI incl. $defs), falls back to a forced tool call on empty/non-JSON content or request error.
AnthropicStructuredOutput / GeminiStructuredOutput — go straight to a forced tool call (their response_format is known-broken on the gateway; no wasted request per simulation).
Unknown providers — response_format first with tool fallback (safe default).

The forced tool wraps the output/input schema under a response property (tool_choice=required), and reads tool_calls[0].arguments["response"] (already a parsed dict). $defs/$ref are inlined so tool parameters are self-contained (the gateway accepts $defs in response_format but not in tool parameters); sibling keys on $ref nodes (e.g. field descriptions) are merged over the inlined definition so LLM guidance survives. chat_completions now accepts raw-dict tools (pass-through) so arbitrary nested schemas reach the gateway verbatim.

E2E validation (live against alpha gateway)

Full customer path per model — @mockable → Mocker.response → generate_structured_output → real gateway → Pydantic coercion of the result. Tool simulation used a nested return model (enum + list of sub-models ⇒ real $defs/$ref); input generation ran generate_llm_input the same way.

Model	Tool simulation (nested `$defs`)	Input generation	Calls
gpt-4.1-mini-2025-04-14	✅ coerced to `Calculation`, total=42.0	✅	1 (`response_format`)
anthropic.claude-sonnet-4-5-20250929-v1:0	✅ coerced to `Calculation`, total=42.0	✅	1 (tool call)
gemini-2.5-pro	✅ coerced to `Calculation`, total=42.0	✅	1 (tool call)

Negative control: reverting the non-JSON-content fallback and re-running the Claude e2e reproduces the exact customer error (UiPathMockResponseGenerationError: Expecting value: line 1 column 1 (char 0)), confirming the e2e discriminates.

Tests

Strategy routing: Claude/Gemini → single forced tool call, no response_format; OpenAI → response_format preferred; unknown → fallback chain (incl. prose-content, empty-content, and request-error triggers).
$defs inlining: self-contained output, sibling-key merge, cyclic-ref handling, caller schema not mutated.
test_raw_dict_tool_passthrough_mocked (platform): nested array schema forwarded byte-for-byte.
Full tests/cli/eval suite (405) + platform mocked LLM tests green; ruff + mypy clean on both packages.

Review feedback addressed

@mjnovice — separate per-provider classes: done (OpenAIStructuredOutput, AnthropicStructuredOutput, GeminiStructuredOutput, shared ToolCallStructuredOutput base).
Copilot — description/implementation mismatch: resolved by the provider strategies (this description now matches); chat_completions docstring updated for raw-dict tools.

🤖 Generated with Claude Code

github-actions · 2026-05-29T07:53:00Z

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨

Your changes may break the uipath-langchain-python integration.

⚠️ These checks are NOT enforced by branch protection rules. Please review the failures before merging.

🔍 Inspect the failed run →

Chibionos · 2026-05-29T08:28:48Z

Re: the uipath-langchain cross-test heads-up above — that was from an earlier commit. After the adaptive fix (prefer response_format, fall back to function calling only when content is empty), the cross-tests pass on the latest commit: langchain-cross / {alpha,cloud,staging} and test-uipath-langchain (3.11/3.12/3.13 × ubuntu/windows) are all green. The heads-up comment is stale and can be disregarded.

Full suite is green: all eval test-cases (calculator-evals, simulation-testcase, tools-evals, etc.) pass across alpha/cloud/staging; lint, SonarCloud, build all pass.

mjnovice · 2026-06-03T01:04:06Z

+logger = logging.getLogger(__name__)
+
+
+def _inline_defs(


Can we have separate classes for how we are doing Anthropic, OpenAI, Gemini etc. ?

Done in a8a3307 — split into per-provider strategy classes: OpenAIStructuredOutput (response_format preferred, tool fallback), AnthropicStructuredOutput and GeminiStructuredOutput (straight to a forced tool call, since their response_format is broken on the gateway — live-verified: Claude returns prose, Gemini empty content), with a shared ToolCallStructuredOutput base. Also avoids the wasted known-dead request per simulation for non-OpenAI models. E2E results on all 3 models are in the updated PR description.

mjnovice

Minor comment about making the generate_structured_output more modular.

Copilot

Pull request overview

This PR addresses eval tool/input simulation failures for non-OpenAI models by introducing a provider-agnostic structured-output helper and updating the eval mockers and LLM gateway integration to support function-calling style structured responses (including nested schemas via raw-dict tool passthrough).

Changes:

Add generate_structured_output() helper to prefer response_format when available and fall back to a forced tool call when content is empty/unsupported.
Update eval LLM mocker and input mocker to use the shared structured-output helper, and adjust/extend unit tests for the new behavior (including non-OpenAI fallback).
Update UiPathLlmChatService.chat_completions() to accept raw-dict tools (pass-through) so nested JSON-schema tool parameters are preserved; bump package versions accordingly.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
packages/uipath/uv.lock	Bumps `uipath` and `uipath-platform` locked versions.
packages/uipath/pyproject.toml	Bumps `uipath` version and raises minimum `uipath-platform` dependency.
packages/uipath-platform/uv.lock	Bumps `uipath-platform` locked version.
packages/uipath-platform/pyproject.toml	Bumps `uipath-platform` version.
packages/uipath-platform/src/uipath/platform/chat/_llm_gateway_service.py	Allows passing raw-dict tools through to the normalized gateway request body.
packages/uipath-platform/tests/services/test_uipath_llm_integration.py	Adds coverage ensuring raw-dict tool schemas are forwarded unchanged and tool_choice serialization works.
packages/uipath/src/uipath/eval/mocks/_structured_output.py	New shared helper for structured output with response_format-first + tool-call fallback; includes schema `$defs` inlining logic.
packages/uipath/src/uipath/eval/mocks/_llm_mocker.py	Switches mock response generation to the shared structured-output helper; improves error propagation.
packages/uipath/src/uipath/eval/mocks/_input_mocker.py	Switches input generation to the shared structured-output helper.
packages/uipath/tests/cli/eval/mocks/test_structured_output.py	New unit tests validating schema wrapping/inlining and response extraction/fallback behavior.
packages/uipath/tests/cli/eval/mocks/test_mocks.py	Updates existing mocks and adds a non-OpenAI fallback regression test (AE-1646).
packages/uipath/tests/cli/eval/mocks/test_input_mocker.py	Adds assertion that OpenAI path uses `response_format` without tool fallback.

Comments suppressed due to low confidence (1)

packages/uipath/src/uipath/eval/mocks/_input_mocker.py:158

generate_llm_input() no longer surfaces a clear JSON-parsing error when the LLM returns invalid JSON (the previous code raised UiPathInputMockingError("Failed to parse LLM response as JSON: ...")). Now a json.JSONDecodeError from generate_structured_output() is wrapped as a generic "Failed to generate input" error, which makes debugging structured-output failures harder.

    except UiPathInputMockingError:
        raise
    except Exception as e:
        raise UiPathInputMockingError(f"Failed to generate input: {str(e)}") from e

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+"""Provider-agnostic structured output for the eval mockers.
+
+The normalized LLM Gateway honors OpenAI-style ``response_format`` (json_schema)
+only for OpenAI models — and does so reliably, including native ``$defs``
+support. Non-OpenAI providers (Anthropic/Claude via Bedrock, Gemini) return such
+requests with ``choices[0].message.content`` empty/None, which breaks JSON
+parsing. Function calling is honored across providers but is less reliable for
+OpenAI on some schemas, so it is used only as a fallback: prefer
+``response_format`` and fall back to a forced tool call when the content comes
+back empty.


        presence_penalty: float = 0,
        top_p: float | None = 1,
        top_k: int | None = None,
-        tools: list[ToolDefinition] | None = None,
+        tools: list[ToolDefinition | dict[str, Any]] | None = None,
        tool_choice: ToolChoice | None = None,
        response_format: dict[str, Any] | type[BaseModel] | None = None,


…models work Tool simulation and input generation in Studio Debug and Evaluation Set runs failed with AGENT_RUNTIME.UNEXPECTED_ERROR for non-OpenAI models (Anthropic Claude via Bedrock, Gemini). The mockers requested structured output via OpenAI-only `response_format` json_schema and parsed `choices[0].message.content`; for Claude that content is empty/None, so `json.loads(...)` raised. Switch both mockers to provider-agnostic function calling (mirrors llm_as_judge_evaluator): build a forced tool that wraps the output/input schema under a `response` property, force it via tool_choice, and read `tool_calls[0].arguments["response"]` (already a parsed dict). Hoist nested `$defs` to the tool-parameters root so `$ref`s from nested Pydantic models still resolve. The normalized LLM gateway now accepts raw-dict tools so arbitrary nested schemas survive (the ToolDefinition converter only emits flat properties). Regression introduced by #1555, which started routing the agent's model into simulations; before that, simulation always used a fixed OpenAI model, so non-OpenAI providers were never exercised on this path. Fixes AE-1646. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ation The normalized gateway accepts $ref/$defs in response_format but not inside a tool's parameters. Tool outputs typed as nested Pydantic models/enums (e.g. calculator's get_random_operator -> Wrapper[Operator]) produced a tool schema with $ref/$defs that the gateway rejected, so simulation failed. Inline the definitions into a self-contained schema (cyclic refs keep their $defs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All-tool-calling regressed OpenAI tool simulation (calculator-evals 'Test Random Addition Using LLM' became flaky: gpt_4_1_mini returned wrong/empty values for a nested-enum output schema via function calling, where response_format was reliable). Make structured-output generation adaptive: prefer response_format (honored reliably by OpenAI, native $defs support) and fall back to a forced tool call only when content comes back empty (the non-OpenAI failure mode, e.g. Claude/Bedrock). Shared in generate_structured_output(), used by both mockers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ON prose Live-verified on alpha: Claude answers response_format requests with plain prose (not empty content), so the empty-content check alone never triggered the fallback and AE-1646 persisted for Claude. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Address review feedback: separate classes per provider (OpenAI prefers response_format with tool fallback; Claude and Gemini go straight to a forced tool call, avoiding a known-dead request per simulation). Merge $ref sibling keys when inlining so field descriptions survive, and document raw-dict tools in chat_completions. Live-verified on alpha for gpt-4.1-mini, claude-sonnet-4-5 and gemini-2.5-pro: tool simulation (nested $defs schema) and input generation pass end-to-end through the @mockable pipeline. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

sonarqubecloud · 2026-06-11T04:42:59Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
98.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Chibionos mentioned this pull request May 29, 2026

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646) #1691

Closed

github-actions Bot added test:uipath-langchain Triggers tests in the uipath-langchain-python repository test:uipath-integrations labels May 29, 2026

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from 5e4bdb2 to ae78cbe Compare May 29, 2026 08:06

mjnovice reviewed Jun 3, 2026

View reviewed changes

mjnovice approved these changes Jun 3, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 8, 2026 17:47

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from b4954be to 5bddc0f Compare June 8, 2026 17:47

Copilot started reviewing on behalf of Chibionos June 8, 2026 17:47 View session

Copilot AI reviewed Jun 8, 2026

View reviewed changes

Chibionos enabled auto-merge (squash) June 10, 2026 23:45

Chibionos and others added 7 commits June 10, 2026 21:33

chore: bump uipath to 2.10.79 and uipath-platform to 0.1.62

40f29d0

The mocker fix in uipath depends on the dict-tool passthrough in uipath-platform, so uipath's lower-bound pin is raised to 0.1.62. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(eval): add explicit type params to _FakeLLM for mypy

94d9629

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Chibionos force-pushed the fix/ae-1646-mocker-non-openai-models branch from 5bddc0f to a8a3307 Compare June 11, 2026 04:39

Chibionos merged commit 13bc71d into main Jun 11, 2026
123 checks passed

Chibionos deleted the fix/ae-1646-mocker-non-openai-models branch June 11, 2026 04:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692

fix(eval): use function calling for tool/input mocking so non-OpenAI models work (AE-1646)#1692
Chibionos merged 7 commits into
mainfrom
fix/ae-1646-mocker-non-openai-models

Chibionos commented May 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Uh oh!

Chibionos Jun 11, 2026

Uh oh!

mjnovice left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		def _inline_defs(

Conversation

Chibionos commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (live-verified on alpha)

Fix

E2E validation (live against alpha gateway)

Tests

Review feedback addressed

Uh oh!

github-actions Bot commented May 29, 2026

🚨 Heads up: uipath-langchain cross-tests are FAILING 🚨

Uh oh!

Chibionos commented May 29, 2026

Uh oh!

mjnovice Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Chibionos Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

mjnovice left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 11, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Chibionos commented May 29, 2026 •

edited

Loading

🚨 Heads up: `uipath-langchain` cross-tests are FAILING 🚨